This is the README for the code package for "Smart Matching Platforms and Heterogeneous Beliefs in Centralized School Choice"

########### CODE ###########

---- Chile ----

Code for the Chile part of the paper is implemented using Stata v16.1, Matlab R2020a, and Latex.

There is one master file in Matlab (/code/chile/main.m) that generates all the figures and tables related to Chile. 
This Matlab file calls other m files that generate each figure/table in the paper.
Comments in main.m identify the code generating each exhibit. 

REQUISITES:
1) Stata must be installed.
2) rdrobust package must be installed in Stata (run "ssc install rdrobust" on Stata).
3) Latex must be installed.

STEPS:
1) Open up main.m.
2) Replace 'your_pc_name' in line 49 with the user name associated with your session. Running 
	"char(java.lang.System.getProperty('user.name'))" in Matlab will tell what to input there.
3) Modify the paths between lines 52 and 65 to the appropriate locations on your computer.
4) Run main.m.

A note on reproducibility: these m files were run on MacOS. Running on PC may result in figure formatting errors, though
the numerical content of the tables and figures should remain identical.

----  New Haven Public Schools (NHPS) ----

Code for the New Haven part of the paper is implemented using Stata v16.1. 

Master.do can be found in the /code/NHPS/ subfolder. This .do file calls all of the project's NHPS-related do files. 
When executed, the entirety of the NHPS portion of the code is run.

The code is broken down into two subsections: /clean/ and /analysis/. /clean/ contains 4 .do files, which import, clean, and
append the relevant raw data sources obtained from the NHPS (see DATA section). The final analysis dataset, 
/data/intermediate_data/student-level-panel, is identified at at the student by year level.

The dataset is then analyzed by the 5 .do files contained in /analysis/. These do files produce every NHPS-related figure and
table in the paper, as well as the NHPS table notes. Each .do file states the input and output files at the top, as well as a
brief description of its purpose. 

Here is a list of each NHPS-related figure/table in the paper and the .do file that runs it:

> Figure VIII - 04_did_plots_nolines.do
> Table K.I - 02_sample_descriptives.do
> Table K.II - 03_regression_tables.do
> Table K.III - 01_balance_table.do
> Figure K.III - 05_risk_binscatters.do
> Figure K.IV - 04_did_plots_nolines.do

########### DATA ###########

---- Chile ----

The Chilean data that we use for this project are confidential. We are not permitted to post them online. We obtained the records 
through an agreement with the Ministry of Education, and through ConsiliumBots, an NGO who provided the web service that allowed the
integration of our live feedback to the application process. The Ministry of Education provides open access to a dataset with the 
final application of each student, priorities, school capacities and all the other relevant inputs to replicate the assignment 
process (https://centroestudios.mineduc.cl/datos-abiertos/). This is available for every year in which the centralized system has 
been implemented. These datasets do not contain the application history, the predicted risk, nor the assignment to any of the treatments
described in the paper; hence, they are not by themselves enough to replicate our findings. Contact the researchers for additional guidance. 

Here is a brief description of the raw inputs used in the Chilean part of the project:

2016:
inputRCT.mat - This is data on choice applications in 2016. It includes application descriptors such as length, measures of nonplacement risk, and attributes of schools. It describes initial applications, endline applications, and pre-treatment applications (where applicable), and descriptors of application modifications. It also includes treatment assignments as well as student and market attributes. 

2017-2021:
inputRD.mat - This file contains data on choice applications from 2017 through 2021. As in 2016, it includes application characteristics, treatment assignments, and student- and market-level descriptors. Data from 2020 and 2021 include treatment assignments from the WhatsApp RCTs run in those years. 

2020:
dataEncuestaMail.csv - This file contains the answers to the survey described in Online Appendix G.

Other datasets:

compilacionInfoGeoref_wNewMarket.csv: geo-reference and educational market for each schooleqCodCursoACodNivelFake2019.csv: crosswalk between multiple ids of school programs FirmData.csv: value added, school expenditure and other school-level measures from Neilson (2021).FirmData.dta: value added, school expenditure and other school-level measures from Neilson (2021).llenos20XX.csv: clasification of under/over subscribed school after round 1 of the choice process, year 20XX.nombresNewMarket2019.csv: label of the educational market oferta_1_2020.csv: seats for the 2020 choice processpolicy.mat: details of the rollout of the centralized school choice over timeschoolsXs2020.csv: characteristics of schools in 2020.simcePerGradeParchado2018.csv: last standardized test availableunicosEnComunaYNoVolunatarios20XX.mat: classification of applicants who had to apply because they were enrolled on terminal grade.

---- NHPS ----

The New Haven Public Schools data that we use for this project are confidential. We are not permitted to post them online. 
We obtained the records through an agreement with the New Haven Public Schools.  Researchers seeking to access these 
records should contact NHPS. The records we use are maintained by the Office of Choice and Enrollment. Contact 
information for the Office of Choice and Enrollment is available at their website, https://www.newhavenmagnetschools.com/ 
(current as of December 2021). Contact the researchers for additional guidance. 

Here is a brief description of the raw inputs used in the NHPS part of the project:

> students_geocodes.dta - this contains geocodes for the households in our sample.
> email_survey2019_clean.csv and email-survey2020-clean.csv - these contain responses to the lottery survey
> apps_change_panel_2020.csv - this contains daily snapshots of the set of all applications in 2020
> apps_changelog_2019.csv and apps_changelog_2020.csv - these are the logs of application changes
> email_logs.csv - these contain the status of warning emails sent to families
> randomizations.csv - this contains data on the treatment assignments
> ratex-table-2019.csv and ratex-table-2020.csv - these contain the risk profiles associated with different schools;
	see Online Appendix F for more details on risk calculation
> sim_use.csv - this contains data on use of the application simulator
> smartchoice_2019.csv, smartchoice_2020.csv, and smartchoice_all.csv - these contain student descriptives and lottery statuses
	at the student by market by priority level
> applications.csv - this contains records of applications at the student by market by priority level.
> cities.csv - this contains residential info
> parents.csv - this contains parents' contact info
> races.csv - this contains student race info
> students.csv - this contains student descriptives